-
Notifications
You must be signed in to change notification settings - Fork 37
Add prepare command #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note to self:
|
|
tests/test_memmap_dataset.py
Outdated
| from fast_llm.data.gpt.memmap import GPTMemmapDataset | ||
| import pytest | ||
|
|
||
| def dtype_arrays(dtype: np.dtype, min_size: int=1, max_size: int=100) -> st.SearchStrategy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following what the hypothesis module brings here. You seem to be just creating a list of random arrays, is that right? This can easily be done in plain numpy with the same function complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is that hypothesis will try to shrink the inputs to the minimal reproducible value in case of a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, assuming my proposed modifications are ok
✨ Description
Extracted and refined the dataset preparation script from #17.
Made it a command like
trainorconvert.Example call and config:
or
where
foo.yamlcontains:Run
git clone https://huggingface.co/HuggingFaceTB/SmolLM-135Mintmpto get that tokenizer file.This will produce:
with
fast_llm_dataset.jsonreading:{ "datasets": [ { "prefix": "shard_0_0", "num_documents": 10000, "num_tokens": 11569536, "weight": 1.0 } ] }The
downloaded_datasetcan be deleted afterwards. It is not used by Fast-LLM.🔍 Type of change
Select all that apply:
📝 Changes
prepare_datasetcommandDockerfile✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General:
Dependencies and Configuration:
Testing:
Performance Impact:
📊 Performance Impact Details
N/A
📝 Additional Notes
N/A